Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DOM HTML serialization for UTF-8 #16376

Merged
merged 2 commits into from
Oct 22, 2024

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Oct 11, 2024

This patch adds a fast path to the HTML serialization encoding that has
to encode to UTF-8. Because the DOM internally represents all strings
using UTF-8, we only need to validate here.

Tested on Wikipedia English home page on an i7-4790, serializing the page 1000 times:

Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     516.0 ms ±   6.4 ms    [User: 511.2 ms, System: 3.5 ms]
  Range (min … max):   506.0 ms … 527.1 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     682.8 ms ±   6.5 ms    [User: 676.8 ms, System: 3.8 ms]
  Range (min … max):   675.8 ms … 695.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.32 ± 0.02 times faster than ./sapi/cli/php_old x.php

(And if you're interested: it takes over a second on my machine using the old DOMDocument class)

Future optimizations are certainly possible, but let's start here.

This patch adds a fast path to the HTML serialization encoding that has
to encode to UTF-8. Because the DOM internally represents all strings
using UTF-8, we only need to validate here.

Tested on Wikipedia English home page on an i7-4790:
```
Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     516.0 ms ±   6.4 ms    [User: 511.2 ms, System: 3.5 ms]
  Range (min … max):   506.0 ms … 527.1 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     682.8 ms ±   6.5 ms    [User: 676.8 ms, System: 3.8 ms]
  Range (min … max):   675.8 ms … 695.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.32 ± 0.02 times faster than ./sapi/cli/php_old x.php
```

(And if you're interested: it takes over a second on my machine using the old DOMDocument class)

Future optimizations are certainly possible, but let's start here.
@nielsdos nielsdos changed the title Dom optimized serialize html Optimize DOM HTML serialization for UTF-8 Oct 11, 2024
@nielsdos nielsdos requested a review from Girgias October 20, 2024 20:05
Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small question, but looks fine to me

Comment on lines -548 to +552
size_t skip = buf_ref - buf_ref_backup; /* Skip invalid data, it's replaced by the UTF-8 replacement bytes */
if (!dom_process_parse_chunk(
ctx,
document,
parser,
buf_ref - last_output - skip,
buf_ref_backup - last_output,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unrelated to the perf optimisation commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fasr path has the same structure as this code, and I noticed the skip variable was useless. So yeah it's more like cleanup.

@nielsdos nielsdos merged commit 935fef2 into php:master Oct 22, 2024
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants